Skip to content

API: consistent NaN treatment for pyarrow dtypes #61732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Jun 28, 2025

This is the third of several POCs stemming from the discussion in #61618 (see #61708, #61716). The main goal is to see how invasive it would be.

Specifically, this changes the behavior of pyarrow floating dtypes to treat NaN as distinct from NA in the constructors and __setitem__ (xref #32265). Also in to_numpy, .values

Notes:

  • This makes the decision to treat NaNs as close-enough to NA when a user explicitly asks for a pyarrow integer dtype. I think this is the right API, but won't check the box until there's a concensus. Changed this following Matt's opinion.
  • I still have 113 89 9 0 failing tests locally. Most of these are in json, sql, or test_EA_types (which is about csv round-tripping).
  • Finding the mask to pass to pa.array needs optimization.
  • The kludge in NDFrame.where is ugly and fragile. Fixed.
  • Need to double-check the new expected in the rank test. Maybe re-write the test with NA instead of NaN?
  • Do we change to_numpy() behavior to not convert NAs to NaNs? this would be needed to make test_setitem_frame_2d_values tests pass

@jbrockmendel
Copy link
Member Author

@mroeschke when convenient id like to get your thoughts before getting this working. it looks pretty feasible.

@mroeschke
Copy link
Member

Generally +1 in this direction. Glad to see the changes to make this work are fairly minimal

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 as well; this is nice.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jul 21, 2025

Not able to judge the implementation, but I'm +1 on the concept.

@simonjayhawkins simonjayhawkins added the PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Jul 22, 2025
Comment on lines +487 to +488
dtype, na_value = to_numpy_dtype_inference(
self, dtype, na_value, hasna, is_pyarrow=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change means to use object dtype instead of converting NA to NaNs?

We initially did that for the masked arrays conversion to numpy, but then changed it use NaNs, because constantly getting object dtype was too annoying (there is some issue discussing this IIRC)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@jorisvandenbossche
Copy link
Member

While I am personally in favor of distinguishing NaN and NA, I think most of the changes here involve distinguishing NaN when constructing the arrays? (so eg constructing the pyarro-based EA from user input like numpy arrays?)

Personally, I think that is a change we should only make after making those dtypes the default, and probably even years after that after a very long deprecation process.
(currently everyone who is creating pandas DataFrames from numpy data assumes that the NaNs in the numpy data is considered as missing. IMO that is a behaviour that we will have to keep (for a long time) even if we distinguish NaN and NA)

@jbrockmendel
Copy link
Member Author

I think most of the changes here involve distinguishing NaN when constructing the arrays?

Yes. Constructors (which affect read_csv) and __setitem__ are most of this.

I think that is a change we should only make after making those dtypes the default, and probably even years after that after a very long deprecation process.

My current thought (will bring up on today's dev call) is that we should add a global flag to enable both never-distinguish (see #61708) as the default and always distinguish (this) as opt-in.

@jbrockmendel jbrockmendel marked this pull request as ready for review July 31, 2025 16:36
@jbrockmendel
Copy link
Member Author

Based on last week's dev call, I am adapting this and #61708 from POCs to real PRs. This implements a global flag "mode.nan_is_na" (default True) to choose which behavior we want.

This PR only implements this for ArrowEA. #61708 will do the same for the numpy-nullables. (I have a branch trying to do it all at once and it is getting ungainly). A third PR will add tests for the various issues this closes.

@jbrockmendel jbrockmendel changed the title POC: consistent NaN treatment for pyarrow dtypes API: consistent NaN treatment for pyarrow dtypes Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants